-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 24 compressor #167
base: add-targets-and-ignore-support
Are you sure you want to change the base?
Add 24 compressor #167
Conversation
2015e71
to
dea129e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean.
lgtm after tests
!!
2f69d16
to
fc4b23c
Compare
dea129e
to
68ca6c3
Compare
dd16499
to
7155e61
Compare
68ca6c3
to
6636872
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code looks simple. I'd like to reformulate the scope, though. Specifically, I'm not following why we are restricting to just 2:4 right now when we could easily expand this to handle all sparsity cases and detect whether it is 2:4 format, some type of structured pruning, and if not any then set as unstructured. cc @dsikka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing?
e56bf72
to
3a6ccc8
Compare
74d6498
to
8fd469f
Compare
This PR introduces the
Sparse24Compressor
, designed for 2:4 sparse models. The implementation is based on #182 and corresponds to Part 3 of the [Design Document](https://www.notion.so/Design-Document-24-Compressor-25ac643aee604c298f2bb12a6c220861?pvs=4).Key Changes
Sparse24Compressor
for handling 2:4 sparsity in models.torch.float8e4m3
dtype.Class Hierarchy
The
Sparse24Compressor
follows the established compressor class hierarchy:File Structure
The
Sparse24Compressor
and associated logic are placed within thesparse_compressors
module:Click to expand Verification Methodology
The `Sparse24Compressor` was tested using a comprehensive script that validates its behavior through the following steps: 1. **Load Model**: An uncompressed model is loaded from the Hugging Face model hub or a local directory. 2. **Compression**: The model is compressed using `ModelCompressor`, and the compressed version is saved. 3. **Decompression**: A new base model is initialized, and the compressed weights are decompressed using `ModelCompressor.decompress`. 4. **Parameter Validation**: Parameters in the decompressed model are verified to match the original uncompressed model. 5. **Inference Check**: The decompressed model is used to generate text, ensuring correctness and functionality.Click to expand the Verification Script
Click to expand the sample output generation from decompressed model
Note: the fp8 test can only run on GPU's with cuda capability > 90
Proof that it passes on the right device: